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Abstract 

Learning an input-output mapping from a set of examples, of the type that many 
neural networks have been constructed to perform, can be regarded as synthesizing an 
approximation of a multi-dimensional function. From this point of view, this form of 
learning is closely related to regularization theory. The theory developed in Poggio 
and Girosi (1989) shows the equivalence between regularization and a class of three- 
layer networks that we call regularization networks or Hyper Basis Functions. These 
networks are not only equivalent to generalized splines, but are also closely related to 
the classical Radial Basis Functions used for interpolation tasks and to several pattern 
recognition and neural network algorithms. In this note, we extend the theory by 
defining a general form of these networks with two sets of modifiable parameters in 
addition to the coefficients c a : moving centers and adjustable norm-weights. Moving 
the centers is equivalent to task-dependent clustering and changing the norm weights 
is equivalent to task-dependent dimensionality reduction. 
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1 Introduction 


In previous papers (Poggio and Girosi, 1989, 1990) we have shown the equiv¬ 
alence between regularization and a class of three-layer networks that we 
called regularization networks and that are related to the classical interpola¬ 
tion technique of Radial Basis Functions. 

Let S = {(xi,yi) G R n xR\i = 1 , ...iV} be a set of data that we want to ap¬ 
proximate by means of a function /. The regularization approach (Tikhonov, 
1963; Tikhonov and Arsenin, 1977; Morozov, 1984; Bertero, 1986) selects the 
function / that solves the variational problem of minimizing the functional 

m = Efe - /(x.)) 2 +aii/yii 2 (i) 

t=l 

where P is a constraint operator (usually a differential operator), || • || is 
a norm on the function space to whom Pf belongs (usually the L 2 norm) 
and A is a positive real number, the so called regularization parameter. The 
structure of the operator P, that is called “stabilizer”, embodies the a priori 
knowledge about the solution, and therefore depends on the nature of the 
particular problem that has to be solved (for instance, it is not needed in 
the case of P corresponding to a Gaussian or bell-shaped Green’s function). 
We have shown (Poggio and Girosi, 1989) that the solution of the variational 
problem (1) has the following simple form: 

/( x ) = ]C c * G ( x ; x ») + p( x ) 

i=1 

where G(x) is the Green’s function (Stakgold, 1979) of the self-adjoint dif¬ 
ferential operator PP, P being the adjoint operator of P, p(x) is a linear 
combination of functions that span the null space of P, and the coefficients 
Ci satisfy a linear system of equations that depend on the N “examples”, i.e. 
the data to be approximated. The form of the term p(x) depends on the sta¬ 
bilizer that has been chosen and on the boundary conditions, and therefore 
on the particular problem that has to be solved. For this reason, and since 
its inclusion does not modify the main conclusions, we will disregard it in the 
following. If P is an operator with radial symmetry, the Green’s function G 
is radial and therefore the approximating function becomes: 
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( 2 ) 


/(*) = Hc t G(||x-Xi|| 2 ), 

«=i 

which is a sum of radial functions, each with its center x t on a distinct data 
point. Thus the number of radial functions, and corresponding centers, is 
the same as the number of examples. 

In this note we indicate how to extend the technique into three natural 
directions: 

1. The computation of a solution of the form (2) has a complexity (number 
of radial functions) that is independent of the dimensionality of the 
input space but is on the order of the dimensionality of the training 
set (number of examples), which is usually high. We show how to 
justify in terms of the regularizationan framework an approximation 
of equation (2) in which the number of centers is much smaller than 
the number of examples and the positions of the centers are modified 
during learning (Poggio and Girosi, 1989). The key idea is to consider 
a specific form of an approximation to the solution of the standard 
regularization problem. Moving centers are equivalent to the free knots 
of nonlinear splines. In the context of networks they were first suggested 
as a potentially useful heuristics by Broomhead and Lowe (1988) and 
used by Moody and Darken (1989). 

2. It is natural to try to extend the form of the solution (2) by considering 
the superposition of different types of Green’s functions (Poggio and 
Girosi, 1989, 1990a) (for example basis functions of different scales). 
This extension is natural within the framework of regularization (and 
has a direct Bayesian interpretation) by considering a more general 
functional than equation (1) containing several stabilizers. We will 
show how the well-defined but underconstrained variational problem 
associated with the new functional can be transformed into an over¬ 
constrained problem. 

3. In equation (2) the norm ||x — x t j| may be considered as a weighted 
norm 


|x - x.ll^v = (x - x,) T W r W(x - x,') 
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where W is a square matrix and the superscript T indicates the trans¬ 
pose. In the simple case of diagonal W the diagonal elements wa assign 
a specific weight to each input coordinate, and the standard Euclidean 
norm is obtained when W is set to the identity matrix. They play 
a critical role whenever different types of inputs are present. We will 
show how the weighted norm idea can be derived from a slightly more 
general functional than equation (1). The associated variational prob¬ 
lem is well-defined but underconstrained; it can be transformed into an 
overconstrained problem by using a certain approximation technique. 

We call Hyper Basis Functions , in short HyperBFs , the most general form 
of regularization networks based on these three extensions. 

2 Moving Centers 

The solution given by standard regularization theory to the approximation 
problem can be very expensive in computational terms when the number of 
examples is very high. The computation of the coefficients of the expansion 
can become then a very time consuming operation: its complexity grows 
polynomially with N, (roughly as N 3 ) since an N x N matrix has to be in¬ 
verted. In addition, the probability of ill-conditioning is higher for larger and 
larger matrices (it grows like N 3 for a N x N uniformly distributed random 
matrix) (Demmel, 1987). We now show a way to reduce the complexity of the 
problem, introducing an approximation to the regularized solution. While 
the exact regularization solution is equivalent to generalized splines with fixed 
knots, the approximated solution is equivalent to generalized splines with free 
knots. 

2.1 An approximation to the regularization solution 

A standard technique, sometimes known as Galerkin’s method, that has been 
used to find approximate solutions of variational problems, is to expand the 
solution on a finite basis. The approximated solution /*(x) has then the 
following form: 


/*( x ) = S c .<Ai( x ) (3) 

i—i 
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where {<&}” =1 is a set of linearly independent functions (Mikhlin, 1965). The 
coefficients c, are usually found according to some rule that guarantees a 
minimum deviation from the true solution. In the case of standard regu¬ 
larization, when the functional to minimize is given by equation (1), this 
method gives the exact solution if n is equal to the numer of data points N, 
and {<^ t }" =1 = {G(x;x,•)}£!_!, where G is the Green’s function of the opera¬ 
tor PP. In this case the unknown coefficients of the expansion (3) can be 
obtained in a simple way by substituting expansion (3) in the regularization 
functional (1), that becomes a function H[f*] = H*(ci ,..., c/v), and then by 
minimizing H[f*] with respect to the coefficients, that is by setting: 


dH[f] 

dci 


= 0 * = 1,...,IV. 


(4) 


It can be easily shown (Poggio and Girosi, 1989) that, if the Green’s 
function vanishes on the boundary of the region that is considered, the set of 
equations (4) is a linear system whose solution gives the standard regulariza¬ 
tion coefficients. In more general cases the basis should be enlarged, 

to include terms that generate the null space of P, in order to obtain the 
correct solution. For simplicity, we disregard these terms in the following, 
since they do not change the main conclusions. A natural approximation to 
the exact solution will be then of the form: 


/*(*) = £ c a G(x; t a ) (5) 

a=1 

where the parameters t OJ that we call “centers”, and the coefficients c a are 
unknown, and are in general fewer than the data points (n < N). This form 
of solution has the desirable property of being an universal approximator for 
continuous functions (Girosi and Poggio, 1989) and to be the only choice 
that guarantees that in the case of n = IV and {t a }” =1 = {x;}" =1 the correct 
solution (of equation 1) is consistently recovered. We will see later in section 
(5) how to find the unknown parameters of this expansion. 


3 Different types of Basis Functions. 

This scheme can be further extended by considering in equation (5) the 
superposition of different types of functions G, such as Gaussians at different 
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scales. 

The function / to be approximated is regarded as the sum of p compo¬ 
nents / m , m = 1,... ,p, each component having a different prior probabil¬ 
ity. This assumption is clearly meaningful only if p « N. Therefore the 
functional H[f] to minimize will contain p stabilizers P m , p regularization 
parameters A m and will be written as 

m = -1 /”( x '» 2 +1 u \ p m r \\ 2 ■ w 

1=1 m=1 m=l 

The Euler-Lagrange equations associated with equation (6) have the form: 

p mpm r (x) = 1 ^ _ £ / -. (Xi))4(x _ X.) m = 1,... ,p. (7) 

A ™ i= 1 k= 1 

As in the case of standard regularization, the solution of equation (7) is 
a linear superposition of Green’s functions: 

r(x) = £>”G’“(x;x,) . (8) 

t=l 

The function F(x) that minimizes the functional H[f] is then a linear su¬ 
perposition of linear superpositions of the Green’s functions G m correspond¬ 
ing to the stabilizers P m , that is 

F(x) = £X>rG"”(x;x i ) +p(x), (9) 

m=l !=1 

where p(x) is a linear combination of functions that span the null spaces 
of the stabilizers. For instance, when G m (x) are Gaussian a polynomial is 
not needed, though it can always be added. For other Green’s functions the 
theory requires an appropriate p(x). 

Substitution of equation (8) in equation (7) yields a linear system for the 
coefficients c™. There is a simple relation between the coefficients associated 
to two different stabilizers, that is 

^m — G * — !)•••) A^, ^ — 1 i • • • i P‘ 
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This means that if a component / m (x) of the solution is given, the other 
p — 1 ones can be recovered by a simple scaling operation. This is expected, 
since the underlying variational problem is underconstrained: we are trying 
to obtain Np coefficients from a set of N data points. The form of the 
solution (9) is appealing: if all the coefficients c™ were independent and free 
to vary, the system could “choose” among different stabilizers, depending on 
the site. In order to retain the form (9) of the solution, while making the 
problem overconstrained instead of underconstrained, we choose a solution 
of the approximation problem of the following form (instead of equation 9): 

F(x)=-£r(x) + p(x), (10) 

771 —\ 

/"■(x) = E CG”(x; C) ( 11 ) 

a=l 

where (1 + d) K m < N and the coefficients c™ and the centers are 
unknowns. They can be found with a technique similar to the one described 
in section (5). Notice that equations (10) and (11) are of the same form as 
equation (5) and share its approximation properties. 

3.1 Multiple Scales. 

This method leads in particular to radial basis functions of multiple scales 
for the reconstruction of the function /. Suppose we know a priori that 
the function to be approximated has components on a number p of scales 
<T\,... ,a p : we can use this information to choose a set of p stabilizers whose 
Green’s functions are, for example, Gaussians of variance o\,... ,cr p . We have 
(Poggio and Girosi, 1989, 1990a) : 

OO . 

||P”7 m || 2 = £«? / dx(D k f k (x )) 2 

k =o JRn 

where D 2k — V 2k , D 2k+1 = VV 2fc and a™ = V being the gradient 
operator. As a result, the solution will be a superposition of superpositions 
of Gaussians of different variances. Of course, the Gaussians with large a 
should be preset, depending on the nature of the problem, to be fewer and 
therefore on a sparser grid, than the Gaussians with a small a. 
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The HyperBF method also yields non-radial Green’s functions - by using 
appropriate stabilizers - and also Green’s functions with a lower dimension¬ 
ality - by using the associated f m and P m in a suitable lower-dimensional 
subspace. Again this reflects a priori information that may be available about 
the nature of the mapping to be learned. In the latter case the information 
is that the mapping is of lower dimensionality or has lower dimensional com¬ 
ponents. 

4 Weighted norm 

The norm in equation (5) is usually intended as an Euclidean norm. If the 
components of x are of different types, it is natural to consider a weighted 
norm defined as 


||x||^ = x T W T Wx, 

since the relative scale of the components is otherwise arbitrary. The case 
in which the matrix W is known (from prior information) does not present 
any difficulty. It is interesting, however, to see what it means in terms of the 
underlying regularization principle. 

4.1 Weighted norm and regularization 

The regularization principle consists in finding the / that minimizes the 
functional: 


Hwlf] = £>i - /( x )) 2 + Wllw (12) 

1 = 1 

where we assume that P is radially symmetric in the variable y and that 
y = Wx (i.e. y is a known linear transformation of x that depends on the 
parameters W). This means that the smoothness constraint is given in a 
space that is an affine transformation of the original x space. The Green’s 
function associated with equation (12) is 

G (lly|| 2 ) = G (ll x ||w) (13) 

with Hxll^y = x r W T Wx. 
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Suppose now that the parameters W are unknown. We can formulate the 
problem of finding / and W that minimize the functional H\v(f). Notice 
that the relevant quantity is M = W r W, since W only appears in this 
form. The matrix M is symmetric and positive definite; it has therefore a 
unique, symmetric “square root” R, such that M = R r R = R 2 . One could 
chose to identify W with R. W would be therefore symmetric, with 
independent parameters. 

Thus finding the optimal W corresponds to finding the best stabilizer 
among those that are expressed in a coordinate system which is a linear 
transformation of the original one. The parameters W of the linear trans¬ 
formation become parameters of H with respect to which the functional is 
minimized. 

The simplest case is the case of W diagonal and G(x) = e~ x . In this 
case 

^(ll x ||w) = • • • e~ xW " 

and thus the components Wi of W are equivalent to the inverse of the variance 
a of each component of the multidimensional Gaussian. 

In the probabilistic interpretation of standard regularization (see Poggio 
and Girosi, 1989) the term A||P/|| 2 in the regularization functional corre¬ 
sponds to the following prior probability in a Bayesian formulation in which 
the MAP (Maximum A Posteriori) estimate is sought: 

Prob{f) = e- A " p/| l 2 . 

Our extension corresponds to choosing the stabilizer Pw = ||-f > /(y)|| 2 ? 
with y = Wx. The stabilizer P\y is parametrized by the matrix W and 
defines a prior Profe\y(/) which is also parametrized by W. 

The solution of the variational problem (12) has the form 

/(x) = E^(||x-x i ||U (14) 

t=i 

where the coefficients c,- and the elements of the matrix W must be estimated. 
Here again we are facing an underconstrained variational problem, since we 
trying to determine N + teL±41 parameters from N data points. The same 
considerations of section (3) apply: in order to transform the problem into 
an overconstrained problem, we look for a solution of the form 
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( 15 ) 


/'(x) = £<iG(l|x-t«llw) 

Q = 1 


5 How to learn centers’ positions and norm 
weights 


Suppose that we look for an approximated solution of the regularization prob¬ 
lem of the form (15). We now have the problem of finding the n coefficients 
c Q , the d x n coordinates of the centers t a and the d elements of the 
matrix M so that the expansion (12) is optimal. To avoid too many indeces, 
we will only consider here the case p = 1 in eq. 10. The extension is obvious. 
In this case we can use the natural definition of optimality given by the func¬ 
tional H. We then impose the condition that the set {c a , t a |a = 1,..., n} and 
the matrix M must be such that they minimizes H[f*], and the following 
equations must be satisfied: 


d H[f*} 
dc a 


= 0 , 


dH[r\ 

dt a 


dH[f*) 

dM 


= 0 , a — 1 , 


,n . 


Gradient-descent is probably the simplest approach for attempting to 
find the solution to this problem, though, of course, it is not guaranteed 
to converge. Several other iterative methods, such as versions of conjugate 
gradient and simulated annealing (Kirkpatrick et al., 1983) may be more 
efficient than gradient descent and should be used in practice. Since the 
function H [/*] to minimize is in general non-convex, a stochastic term in the 
gradient descent equations may be advisable to avoid local minima. In the 
stochastic gradient descent method the values of c a , t a and M that minimize 
H [/*] are regarded as the coordinates of the stable fixed point of the following 
stochastic dynamical system: 


dH[T], m . 

c a = -u— - + r} Q (t), a = 1,... 

uc a 

dHlf* 3 

ta - -<jJ —^-1- Oi = 1, . . 

Vjr dHlf*] 
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where n a (t ) and ft ( t ) are white noise of zero mean and a; is a parameter 

determining the microscopic timescale of the problem and is related to the 
rate of convergence to the fixed point. Defining 

A; = yi - /*(x) =y { -J2 Co,C?(||x t - - t„||w) 

Of=l 

and setting A = 0 for simplicity (the more general case can be approached in 
a similar way) in equation (1) we obtain 


mn = = E( a,) 2 . 

t=i 

The important quantities - that can be used in more efficient schemes 
than gradient descent - are, with ||x t - — t^H^y = (x t - — t a ) T M(x,- — t„) and 
M = W r W: 

• for the c a 


= -2 E A.Gdlxi - t„||V) ; (16) 

oc a l= 2 

• for the centers t a 

= 4c„f; AiG'dlx,- - t a ||yy)M(x s - - t a ) (17) 

at <* t=l 

• and for M 

= - 2 E E AiG'dix, - u W)c?,> (is) 

alV1 a =1 i=l 

where Qi jQ — (x,- — t a )(xi — t a ) T is a dyadic product and G' is the first 

derivative of G. 

Remarks 
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1. Instead of equation (18) for M the following equation can be used for 
W: 


= —4W £ c„ jr AiG'dlXi - t„||W)0,,<. (19) 

2. From equation (18) the matrix M is guaranteed to remain symmetric in 
a deterministic gradient descent scheme, since the right hand-side of the 
equation is symmetric (because the Qi a are correlation matrices and a 
linear combination of symmetric matrices is symmetric). Of course, the 
initial value must be a symmetric matrix and in the stochastic update 
scheme, the noise term must not break the symmetry. The matrix 
M must satisfy the additional constraint of remaining positive definite 
(since the scalar product x r Mx must be non-negative). We conjecture 
that equations (16), (17) and (18) conserve the positive definiteness of 
M if G is positive definite. 

3. Equation (16) has a simple interpretation: the correction is equal to 
the sum over the examples of the products between the error on that 
example and the “activity” of the “unit” that represents with its center 
that example. Notice that H[f*\ is quadratic in the coefficients c a , and 
if the centers and the matrix M are kept fixed, it can be shown (Poggio 
and Girosi, 1989) that the optimal coefficients are given by 

c = (G? G + \g)-'G T y (20) 

where we have defined (y),- = (c) Q = c a , ( G)i a = (j(x;;t a ) and 
(d)a/3 — G(t a ; t^g). If A is let go to zero, the matrix on the right side of 
equation (20) converges to the pseudoinverse of G (Albert, 1972), and if 
the Green’s function is radial the approximation method of Broomhead 
and Lowe (1988) is recovered. 

4. Equation (17) is similar to task-dependent clustering (Poggio and Girosi, 
1989). This can be best seen by assuming that A* are constant: then 
the gradient descent updating rule makes the centers move as a func¬ 
tion of the majority of the data, that is of the position of the clusters. 
In this case a technique similar to the k-means algorithm is recovered 
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(MacQueen, 1967; Moody and Darken, 1989). Equating dI $ 1 to zero 
we notice that, when the matrix M is set to the identity matrix, the 
optimal centers t a satisfy the following set of nonlinear equations: 


Ei P?*i 
E iP? 


a = 1,... ,n 


where Pf = A;G"(||x t —t Q || 2 ). The optimal centers are then a weighted 
sum of the data points. The weight P-* of the data point i for a given 
center t tt is high if the interpolation error A,- is high there and the radial 
basis function centered on that knot changes quickly in a neighborhood 
of the data point. This observation suggests faster update schemes, in 
which a suboptimal position of the centers is first found and then the 
c a are determined, similarly to the algorithm developed and tested 
successfully by Moody and Darken (1989). 

5. Equation (19) (by assuming that Ea=i c a A;Cr'(||x; — Mlw) is asymp¬ 
totically constant (!!)) contains the quantity E^Li Qi,a which is an 
estimate of the correlation matrix of all the examples relative to t a 
(modulus a normalization factor). Let us define C m , a as the d xm ma¬ 
trix whose columns are the vectors of the examples x x —1„,...,x m — t a . 
Then Ejli Qi,a can be written as Ejli Qi,a = C;v,aCjv a an d * s the d x d 
correlation matrix (d being the number of components of x). Interest¬ 
ingly, in this case, equation (19), when inserted in the gradient descent 
equation, has the form: 


W = -W Q 


which has the solution 


W (t) = W{0)e~ Qt = W(0) £ e-^ejej 

3 = 1 

where ej are the eigenvectors of Q and A j are the associated eigenvalues. 
All eigenvectors will decay to 0, the ones with the largest eigenvalues 
fastest. Since in the full equation the other terms such as A, will keep 
W from decaying to 0, we may expect that W will converge to a matrix 
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with rows that are similar to the eigenvectors of Q with the smallest 
eigenvalues. In other words, the equation should converge to rows of W 
that span the space orthogonal to the space spanned by the principal 
components of the input examples (i.e. the eigenvectors of Q with the 
largest eigenvalues). In this case, the matrix M is a projection operator 
that projects x into a space orthogonal to the space of the principal 
components. The principal components are the singular vectors of X, 
with the property that they span a nested set of optimal subspaces. 
This interpretation of the gradient descent equation is just a rough in¬ 
dication of what may happen, because of the very strong underlying 
assumptions. It turns out that in the object recognition case (Poggio 
and Edelmann, 1990), the interpretation is perfectly consistent with 
what one expects, given the (linear) computational theory underlying 
the problem (Basri and Ullmann, 1990; see also the appendix in Edel¬ 
mann and Poggio, 1990). Under orthographic projection, the vectors 
representing views of the same object span a linear subspace with a low 
dimension. Let us assume, according to the above discussion, that W 
projects a new input vector into a space orthogonal to the one spanned 
by the principal components extracted from many views of the object 
(the “examples”). Then, if the new input is another view of the same 
object, the result will be close to zero for all units. In the case of the 
Gaussian, for instance, this means that each unit will be maximally 
activated and by suitable choice of c any desired output may be syn¬ 
thesized. On the other hand, if the new input is the view of a different 
object, the result of operating on it with W will be different from 
zero and possibly large enough to give a very small activity of the unit 
making it impossible to synthesize a desired output by an appropriate 
choice of the c (the output will be zero or close to it). In this case, 
the appropriate W will solve the problem with just one center (since 
the problem is linear). Notice that if W is symmetric (i.e. if W is the 
square root of M), it has the same eigenvectors of M, and M and W 
have the same null space. 

6. One may think intuitively that it is desirable that W is space depen¬ 
dent, that is W = W(x). This assumption, however, seems rather 
meaningless from the point of view of regularization theory. As a con¬ 
sequence, we believe that it is wrong to assume W = W(x) in a scheme 
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such as HyperBF. On the other hand, it makes theoretically sense to 
use different HyperBF networks for different subsets of the domain of 
the given multivariate function, each one possibly with a different W. 
We do not have any theory, however, of how to partition appropriately 
the domain of the function. An alternative approach, that also makes 
sense, is local linear approximation. In this case one finds a set of local 
charts, somewhat similarly to computing W(x). 

5.1 A practical algorithm 

It seems natural to try to find a reasonable initial value for the parameters 
to start the minimization process. In the absence of more specific 
prior information the following heuristics seems reasonable. 

• Set the number of centers and set the centers’ positions to positions 
suggested by cluster analysis of the data (or more simply to a subset 
of the examples’ positions). 

• Set the rows of W to be vectors orthogonal to the eigenvectors with 
largest eigenvalues of £; Qi, a - 

• Use matrix pseudo-inversion to find the c a . 

• Use the t a , M = W r W and c a found so far as initial values for gradient 
descent equations. 

It should be noticed than an even more general strategy makes sense in 
some cases. Suppose that the system can be made to operate satisfactorily 
with the steps above or perhaps just with the first step. Suppose also that the 
system can continue to accumulate examples while operating. An example 
could be an autonomous vehicle that can improve, say, the model of its 
dynamics by collecting appropriate example pairs while operating. Then it 
makes sense to perform dimensionality reduction and to move the centers as 
outlined above. As an additional step one may try to eliminate features that 
receive little weight, if possible, and then to add other features while keeping 
the previously found centers. This is equivalent to adding centers of higher 
dimensionality. Another iteration of moving centers, finding norm weights, 
eliminating features and centers then takes place. 
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Experiments with movable centres and movable weights have been per¬ 
formed in the context of object recognition (Poggio and Edelman, 1990; Edel- 
man and Poggio, 1990) and approximation of multivariate functions (Caprile, 
Girosi and Poggio, 1990) and in both cases the results are promising. 

6 Remarks 

1. Equation (19) is similar to an operation of (task-dependent) dimension¬ 
ality reduction (Duda and Hart, 1973) whereas equation (17) is similar 
to a clustering process. 

2. It is conceivable that learning the weights of the norm is even more 
important than learning the centers and that in many cases it may be 
preferable to set the centers to a representative subset of the data and 
to keep them fixed thereafter. 

3. A specific matrix W corresponds to a specific metric in the multidi¬ 
mensional input space: W projects the input vector into the subspace 
spanned by its rows. In the case of the rows of W spanning the space 
orthogonal to the principal components of the inputs, W assigns a 
metric ellipsoid with the largest axes (corresponding to a large a in 
the Gaussian) along the principal components and the small axis (cor¬ 
responding to a small a in the Gaussian) orthogonal to it: thus even 
vectors that are far away (in the ordinary euclidean metric) are close in 
this metric if they lie in the hyperplane of the principal components and 
even close vectors (in the ordinary metric) are far away in the metric 
induced by W if they are orthogonal to the principal components. 

4. In the case of N examples, n = N fixed centers and M = /, there are 
enough data to constrain the N coefficients c a to be found. Moving 
centers add another nd parameters (d is the number of input compo¬ 
nents) and the matrix M another independent parameters. Thus 
the number of examples N must be sufficiently large to constrain ad¬ 
equately the free parameters - n d-dimensional centers, n coefficients 
c Q and independent entries of the matrix M. Thus 
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N » n + nd -\ --—. 

5. In the case of Gaussian basis functions, learning the entries of a diagonal 
W is equivalent to learning the variances of each two-dimensional (or 
one-dimensional) Gaussian receptive field for each center. It is clear 
that sets of units with different scales (see section 3.1) correspond to 
sets of units with different W. 
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